107 research outputs found
Cascade Learning by Optimally Partitioning
Cascaded AdaBoost classifier is a well-known efficient object detection
algorithm. The cascade structure has many parameters to be determined. Most of
existing cascade learning algorithms are designed by assigning detection rate
and false positive rate to each stage either dynamically or statically. Their
objective functions are not directly related to minimum computation cost. These
algorithms are not guaranteed to have optimal solution in the sense of
minimizing computation cost. On the assumption that a strong classifier is
given, in this paper we propose an optimal cascade learning algorithm (we call
it iCascade) which iteratively partitions the strong classifiers into two parts
until predefined number of stages are generated. iCascade searches the optimal
number ri of weak classifiers of each stage i by directly minimizing the
computation cost of the cascade. Theorems are provided to guarantee the
existence of the unique optimal solution. Theorems are also given for the
proposed efficient algorithm of searching optimal parameters ri. Once a new
stage is added, the parameter ri for each stage decreases gradually as
iteration proceeds, which we call decreasing phenomenon. Moreover, with the
goal of minimizing computation cost, we develop an effective algorithm for
setting the optimal threshold of each stage classifier. In addition, we prove
in theory why more new weak classifiers are required compared to the last
stage. Experimental results on face detection demonstrate the effectiveness and
efficiency of the proposed algorithm.Comment: 17 pages, 20 figure
Pedestrian Detection Inspired by Appearance Constancy and Shape Symmetry
The discrimination and simplicity of features are very important for
effective and efficient pedestrian detection. However, most state-of-the-art
methods are unable to achieve good tradeoff between accuracy and efficiency.
Inspired by some simple inherent attributes of pedestrians (i.e., appearance
constancy and shape symmetry), we propose two new types of non-neighboring
features (NNF): side-inner difference features (SIDF) and symmetrical
similarity features (SSF). SIDF can characterize the difference between the
background and pedestrian and the difference between the pedestrian contour and
its inner part. SSF can capture the symmetrical similarity of pedestrian shape.
However, it's difficult for neighboring features to have such above
characterization abilities. Finally, we propose to combine both non-neighboring
and neighboring features for pedestrian detection. It's found that
non-neighboring features can further decrease the average miss rate by 4.44%.
Experimental results on INRIA and Caltech pedestrian datasets demonstrate the
effectiveness and efficiency of the proposed method. Compared to the
state-of-the-art methods without using CNN, our method achieves the best
detection performance on Caltech, outperforming the second best method (i.e.,
Checkboards) by 1.63%.Comment: 9 pages,17 figure
Learning Multilayer Channel Features for Pedestrian Detection
Pedestrian detection based on the combination of Convolutional Neural Network
(i.e., CNN) and traditional handcrafted features (i.e., HOG+LUV) has achieved
great success. Generally, HOG+LUV are used to generate the candidate proposals
and then CNN classifies these proposals. Despite its success, there is still
room for improvement. For example, CNN classifies these proposals by the
full-connected layer features while proposal scores and the features in the
inner-layers of CNN are ignored. In this paper, we propose a unifying framework
called Multilayer Channel Features (MCF) to overcome the drawback. It firstly
integrates HOG+LUV with each layer of CNN into a multi-layer image channels.
Based on the multi-layer image channels, a multi-stage cascade AdaBoost is then
learned. The weak classifiers in each stage of the multi-stage cascade is
learned from the image channels of corresponding layer. With more abundant
features, MCF achieves the state-of-the-art on Caltech pedestrian dataset
(i.e., 10.40% miss rate). Using new and accurate annotations, MCF achieves
7.98% miss rate. As many non-pedestrian detection windows can be quickly
rejected by the first few stages, it accelerates detection speed by 1.43 times.
By eliminating the highly overlapped detection windows with lower scores after
the first stage, it's 4.07 times faster with negligible performance loss
Learning Sampling Distributions for Efficient Object Detection
Object detection is an important task in computer vision and learning
systems. Multistage particle windows (MPW), proposed by Gualdi et al., is an
algorithm of fast and accurate object detection. By sampling particle windows
from a proposal distribution (PD), MPW avoids exhaustively scanning the image.
Despite its success, it is unknown how to determine the number of stages and
the number of particle windows in each stage. Moreover, it has to generate too
many particle windows in the initialization step and it redraws unnecessary too
many particle windows around object-like regions. In this paper, we attempt to
solve the problems of MPW. An important fact we used is that there is large
probability for a randomly generated particle window not to contain the object
because the object is a sparse event relevant to the huge number of candidate
windows. Therefore, we design the proposal distribution so as to efficiently
reject the huge number of non-object windows. Specifically, we propose the
concepts of rejection, acceptance, and ambiguity windows and regions. This
contrasts to MPW which utilizes only on region of support. The PD of MPW is
acceptance-oriented whereas the PD of our method (called iPW) is
rejection-oriented. Experimental results on human and face detection
demonstrate the efficiency and effectiveness of the iPW algorithm. The source
code is publicly accessible.Comment: 14 pages, 13 figure
Simultaneously Learning Neighborship and Projection Matrix for Supervised Dimensionality Reduction
Explicitly or implicitly, most of dimensionality reduction methods need to
determine which samples are neighbors and the similarity between the neighbors
in the original highdimensional space. The projection matrix is then learned on
the assumption that the neighborhood information (e.g., the similarity) is
known and fixed prior to learning. However, it is difficult to precisely
measure the intrinsic similarity of samples in high-dimensional space because
of the curse of dimensionality. Consequently, the neighbors selected according
to such similarity might and the projection matrix obtained according to such
similarity and neighbors are not optimal in the sense of classification and
generalization. To overcome the drawbacks, in this paper we propose to let the
similarity and neighbors be variables and model them in low-dimensional space.
Both the optimal similarity and projection matrix are obtained by minimizing a
unified objective function. Nonnegative and sum-to-one constraints on the
similarity are adopted. Instead of empirically setting the regularization
parameter, we treat it as a variable to be optimized. It is interesting that
the optimal regularization parameter is adaptive to the neighbors in
low-dimensional space and has intuitive meaning. Experimental results on the
YALE B, COIL-100, and MNIST datasets demonstrate the effectiveness of the
proposed method
Stacked Semantic-Guided Attention Model for Fine-Grained Zero-Shot Learning
Zero-Shot Learning (ZSL) is achieved via aligning the semantic relationships
between the global image feature vector and the corresponding class semantic
descriptions. However, using the global features to represent fine-grained
images may lead to sub-optimal results since they neglect the discriminative
differences of local regions. Besides, different regions contain distinct
discriminative information. The important regions should contribute more to the
prediction. To this end, we propose a novel stacked semantics-guided attention
(S2GA) model to obtain semantic relevant features by using individual class
semantic features to progressively guide the visual features to generate an
attention map for weighting the importance of different local regions. Feeding
both the integrated visual features and the class semantic features into a
multi-class classification architecture, the proposed framework can be trained
end-to-end. Extensive experimental results on CUB and NABird datasets show that
the proposed approach has a consistent improvement on both fine-grained
zero-shot classification and retrieval tasks
Video Summarization with Attention-Based Encoder-Decoder Networks
This paper addresses the problem of supervised video summarization by
formulating it as a sequence-to-sequence learning problem, where the input is a
sequence of original video frames, the output is a keyshot sequence. Our key
idea is to learn a deep summarization network with attention mechanism to mimic
the way of selecting the keyshots of human. To this end, we propose a novel
video summarization framework named Attentive encoder-decoder networks for
Video Summarization (AVS), in which the encoder uses a Bidirectional Long
Short-Term Memory (BiLSTM) to encode the contextual information among the input
video frames. As for the decoder, two attention-based LSTM networks are
explored by using additive and multiplicative objective functions,
respectively. Extensive experiments are conducted on three video summarization
benchmark datasets, i.e., SumMe, and TVSum. The results demonstrate the
superiority of the proposed AVS-based approaches against the state-of-the-art
approaches,with remarkable improvements from 0.8% to 3% on two
datasets,respectively..Comment: 9 pages, 7 figure
Query-Aware Sparse Coding for Multi-Video Summarization
Given the explosive growth of online videos, it is becoming increasingly
important to relieve the tedious work of browsing and managing the video
content of interest. Video summarization aims at providing such a technique by
transforming one or multiple videos into a compact one. However, conventional
multi-video summarization methods often fail to produce satisfying results as
they ignore the user's search intent. To this end, this paper proposes a novel
query-aware approach by formulating the multi-video summarization in a sparse
coding framework, where the web images searched by the query are taken as the
important preference information to reveal the query intent. To provide a
user-friendly summarization, this paper also develops an event-keyframe
presentation structure to present keyframes in groups of specific events
related to the query by using an unsupervised multi-graph fusion method. We
release a new public dataset named MVS1K, which contains about 1, 000 videos
from 10 queries and their video tags, manual annotations, and associated web
images. Extensive experiments on MVS1K dataset validate our approaches produce
superior objective and subjective results against several recently proposed
approaches.Comment: 10 pages, 8 figure
Cascaded Subpatch Networks for Effective CNNs
Conventional Convolutional Neural Networks (CNNs) use either a linear or
non-linear filter to extract features from an image patch (region) of spatial
size (Typically, is small and is equal to , e.g.,
is 5 or 7). Generally, the size of the filter is equal to the size of the input patch. We argue that the representation ability of equal-size
strategy is not strong enough. To overcome the drawback, we propose to use
subpatch filter whose spatial size is smaller than .
The proposed subpatch filter consists of two subsequent filters. The first one
is a linear filter of spatial size and is aimed at extracting
features from spatial domain. The second one is of spatial size
and is used for strengthening the connection between different input feature
channels and for reducing the number of parameters. The subpatch filter
convolves with the input patch and the resulting network is called a subpatch
network. Taking the output of one subpatch network as input, we further repeat
constructing subpatch networks until the output contains only one neuron in
spatial domain. These subpatch networks form a new network called Cascaded
Subpatch Network (CSNet). The feature layer generated by CSNet is called csconv
layer. For the whole input image, we construct a deep neural network by
stacking a sequence of csconv layers. Experimental results on four benchmark
datasets demonstrate the effectiveness and compactness of the proposed CSNet.
For example, our CSNet reaches a test error of on the CIFAR10
dataset without model averaging. To the best of our knowledge, this is the best
result ever obtained on the CIFAR10 dataset
Transductive Zero-Shot Learning with Adaptive Structural Embedding
Zero-shot learning (ZSL) endows the computer vision system with the
inferential capability to recognize instances of a new category that has never
seen before. Two fundamental challenges in it are visual-semantic embedding and
domain adaptation in cross-modality learning and unseen class prediction steps,
respectively. To address both challenges, this paper presents two corresponding
methods named Adaptive STructural Embedding (ASTE) and Self-PAsed Selective
Strategy (SPASS), respectively. Specifically, ASTE formulates the
visualsemantic interactions in a latent structural SVM framework to adaptively
adjust the slack variables to embody the different reliableness among training
instances. In this way, the reliable instances are imposed with small
punishments, wheras the less reliable instances are imposed with more severe
punishments. Thus, it ensures a more discriminative embedding. On the other
hand, SPASS offers a framework to alleviate the domain shift problem in ZSL,
which exploits the unseen data in an easy to hard fashion. Particularly, SPASS
borrows the idea from selfpaced learning by iteratively selecting the unseen
instances from reliable to less reliable to gradually adapt the knowledge from
the seen domain to the unseen domain. Subsequently, by combining SPASS and
ASTE, we present a self-paced Transductive ASTE (TASTE) method to progressively
reinforce the classification capacity. Extensive experiments on three benchmark
datasets (i.e., AwA, CUB, and aPY) demonstrate the superiorities of ASTE and
TASTE. Furthermore, we also propose a fast training (FT) strategy to improve
the efficiency of most of existing ZSL methods. The FT strategy is surprisingly
simple and general enough, which can speed up the training time of most
existing methods by 4~300 times while holding the previous performance
- …